- Cluster analysis (kmeans and hierarchical)
- Reading raster files
- Principal component analysis
Cluster analysis assignes each data point to one of k clusters, thereby maximmizing the between-cluster variance and minimizing the within-cluster variance.
There are two principal types of cluster algorithms: partionating algorithms (e.g. kmeans) and hierarchical algorithms.
A kmeans cluster analysis can be performed using the kmeans() function:
data(iris) iris_kmeans <- kmeans(x = iris[, c(1:4)], centers = 3)
## K-means clustering with 3 clusters of sizes 33, 96, 21 ## ## Cluster means: ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## 1 5.175758 3.624242 1.472727 0.2727273 ## 2 6.314583 2.895833 4.973958 1.7031250 ## 3 4.738095 2.904762 1.790476 0.3523810 ## ## Clustering vector: ## [1] 1 3 3 3 1 1 1 1 3 3 1 1 3 3 1 1 1 1 1 1 1 1 1 1 3 3 1 1 1 3 3 1 1 1 3 ## [36] 1 1 1 3 1 1 3 3 1 1 3 1 3 1 1 2 2 2 2 2 2 2 3 2 2 3 2 2 2 2 2 2 2 2 2 ## [71] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 3 2 2 2 2 2 2 ## [106] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 ## [141] 2 2 2 2 2 2 2 2 2 2 ## ## Within cluster sum of squares by cluster: ## [1] 6.432121 118.651875 17.669524 ## (between_SS / total_SS = 79.0 %) ## ## Available components: ## ## [1] "cluster" "centers" "totss" "withinss" ## [5] "tot.withinss" "betweenss" "size" "iter" ## [9] "ifault"
We can plot the cluster solution using ggplot:
iris$cluster3 <- iris_kmeans$cluster p <- ggplot(iris, aes(Sepal.Length, y = Sepal.Width, col = factor(cluster3))) + geom_point()
And compare it to the acutal species classification:
table(iris$cluster3, iris$Species)
## ## setosa versicolor virginica ## 1 33 0 0 ## 2 0 46 50 ## 3 17 4 0
More/less clusters lead to different solutions:
iris$cluster2 <- kmeans(x = iris[, c(1:4)], centers = 2)$cluster iris$cluster4 <- kmeans(x = iris[, c(1:4)], centers = 4)$cluster iris$cluster5 <- kmeans(x = iris[, c(1:4)], centers = 5)$cluster iris$cluster6 <- kmeans(x = iris[, c(1:4)], centers = 6)$cluster p <- iris %>% gather(key = k, value = cluster, -(Sepal.Length:Species)) %>% ggplot(., aes(Sepal.Length, y = Sepal.Width, col = factor(cluster))) + geom_point() + facet_wrap(~k, ncol = 6)
Hierarchical clustering can be performed using the hclust() function. However, we first need to calcuate the distances (similarities) between each data point pair:
dist <- dist(iris[, 1:4])
There are several methods for calculating the distance (e.g., euclidean, manhatten, …), see ?dist(). With the distance matrix we can hierarchically cluster all data points:
iris_hc <- hclust(dist)
And plot i using the ggdrndro package:
library(ggdendro) p <- ggdendrogram(iris_hc)
See https://cran.r-project.org/web/packages/ggdendro/vignettes/ggdendro.html for further information.
For todays exercise we are going to use data that is stored in a raster stack, that is a stack of gridded rasters with matching spatial resolution and extent.
Reading a raster stack can be done using the stack() function in the raster package:
library(raster)
clim <- stack("Data/climate.envi")
Raster stacks can be of any format (geoTiff, ENVI, grid, bsq, etc.) and with or without spatial reference. For reading a single raster instead of a stack, you can use the raster() function.
clim
## class : RasterStack ## dimensions : 709, 1306, 925954, 4 (nrow, ncol, ncell, nlayers) ## resolution : 0.008333333, 0.008333333 (x, y) ## extent : 16.90833, 27.79167, 44.33333, 50.24167 (xmin, xmax, ymin, ymax) ## coord. ref. : +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0 ## names : Band.1, Band.2, Band.3, Band.4 ## min values : 0.08333333, -52.83333333, -26.41666667, 38.41666667 ## max values : 173.2500, 77.2500, 124.6667, 146.6667
names(clim) <- c("Tmin", "Tmean", "Tmax", "Precip")
Plotting the raster is easily achieved by the plot() command. However, you have to select a layer for plotting:
plot(clim, y = 1)
Rasters can have millions of pixels, which can make computations very slow. To deal with this issue, we are going to sample 10,000 pixels from the raster:
clim_samp <- sampleRandom(clim, size = 10000) clim_samp <- as.data.frame(clim_samp) # sampleRandom returns a matrix clim_samp[1:5, ]
## Tmin Tmean Tmax Precip ## 1 134.7500 32.66667 83.41666 54.33333 ## 2 158.0000 58.16667 107.75000 44.00000 ## 3 113.3333 25.16667 69.00000 63.41667 ## 4 141.0833 32.58333 86.66666 51.58333 ## 5 118.2500 30.83333 74.33334 55.16667
library(GGally) p <- clim_samp %>% sample_n(., size = 100) %>% ggpairs(.)
There are two functions in R for performaing a PCA: princomp() and prcomp(). While the former function calculates the eigenvalue on the correlation/covariance matrix using the function eigen(), the latter function uses singular value decomposition, which numerically more stable. We here are going to use prcomp().
pca <- prcomp(clim_samp, scale. = TRUE)
You should set the scale. argument to TRUE, to scale the data matrix to unit variances. The default is scale. = FALSE. You can also use scale() to standardize the original data matrix and then enter the standardized data matrix to prcomp.
The prcomp object stores the rotations (eigenvectors) which define the contribution of each variable to a component:
pca$rotation
## PC1 PC2 PC3 PC4 ## Tmin -0.5274767 -0.2337721 0.678194066 -4.551612e-01 ## Tmean -0.5271176 -0.2277391 -0.733096738 -3.644874e-01 ## Tmax -0.5320480 -0.2331100 0.051063234 8.123898e-01 ## Precip 0.4010490 -0.9160487 -0.003811388 3.914667e-05
The eigenvalues are also stored in the prcomp object:
pca$sdev
## [1] 1.85194821 0.73096506 0.18965623 0.00290114
And can be easily translated into variance explained:
var_exp <- data.frame(pc = 1:4,
var_exp = pca$sdev^2 / sum(pca$sdev^2))
var_exp$var_exp_cumsum <- cumsum(var_exp$var_exp)
p <- ggplot(var_exp) +
geom_bar(aes(x = pc, y = var_exp), stat = "identity") +
geom_line(aes(x = pc, y = var_exp_cumsum))
We can now apply the PCA transformation to the whole raster stack:
clim_pca <- raster::predict(clim, pca, index = 1:4)
Let's combine PCA and kmeans:
pca_values <- as.data.frame(clim_pca) pca_kmeans <- kmeans(pca_values[, 1:3], centers = 5) kmean_raster <- raster(clim_pca) values(kmean_raster) <- pca_kmeans$cluster plot(kmean_raster)